A Dynamic Topic Model for Document Segmentation
نویسندگان
چکیده
Factor language models, like Latent Semantic Analysis, represent documents as mixtures of topics, and have a variety of applications. Normally, the mixture is computed at the whole-document level, that is, the entire document contains material on several topics, without specifying where they occur in the document. In this paper, we describe a new model which computes the topic mixture estimate for every word in each document. There are a number of applications of this model, but a natural one is topical document segmentation which we explore in this paper. Topical segmentation breaks a document into passages that are mostly about a single topic and so that adjacent passages have different topics. Most previous works have started with an a-priori segmentation (primarily multi-sentence passages). The goal in this setting is to merge the a-priori segments to build topic-based passages. Our method uses no a-priori segmentation of the text, and can mark boundaries anywhere (i.e. between any adjacent words), although it is more likely to do so at natural boundaries such as sentences and paragraphs. Our model accomplishes this fine-grain segmentation by computing a per-word topic mixture distribution. We first show that per-word mixture analysis is a natural extension of an earlier factor model (specifically the GammaPoisson model). Next we detail the computational efficiency of our model – it costs only slightly more than traditional per-document topic mixture methods. Finally we present some experimental results.
منابع مشابه
یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجرههای همپوشان
A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملPersian Printed Document Analysis and Page Segmentation
This paper presents, a hybrid method, low-resolution and high-resolution, for Persian page segmentation. In the low-resolution page segmentation, a pyramidal image structure is constructed for multiscale analysis and segments document image to a set of regions. By high-resolution page segmentation, by connected components analysis, each region is segmented to homogeneous regions and identifyi...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملBilingual Segmented Topic Model
This study proposes the bilingual segmented topic model (BiSTM), which hierarchically models documents by treating each document as a set of segments, e.g., sections. While previous bilingual topic models, such as bilingual latent Dirichlet allocation (BiLDA) (Mimno et al., 2009; Ni et al., 2009), consider only cross-lingual alignments between entire documents, the proposed model considers cros...
متن کامل